Personal Loan Campaign - AllLife Bank

Solution with Logistic regression and Decision tree Modeling

Building Logistic Regression & Decision Tree model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

Data Dictionary & Column Details

Understand Given Data

Read given data to data frame and understand data nature, given features, total records, given data has any missing values or duplicate data, outliers.

Visualize data and and understand data range and outliers

Loading necessary libraries for EDA

Load all standard python library packages.

Data Manipulation

Data Visualization

Load data to dataframe

Read given csv file Loan_Modelling.csv and load to data frame data.

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

observations on data

Check the data types of the columns in the dataset.

checking data types of all columns

observations on data

Summary of the data

observations on data

Let's check for missing values

lets check which columns has some null values, how many null values

observations

Let's check the duplicate data. And if any, we should remove it.

observations

Data in category columns

Lets us look at different values by features

observations

Droping ID Column

Exploratory Data Analysis - Pre Data processing

Visualize all features before any data clean up and understand what data needs cleaning and fixing.

Initial Univariate analysis

Univariate analysis helps to check data skewness and possible outliers and spread of the data.

creating a method that can plot univariate chart with histplot, boxplot and barchart %

Check all Boolean type features

Check how Personal_Loan data is distributed

Observation Personal_Loan

Check how Securities_Account data is distributed

Observation Securities_Account

Check how CD_Account data is distributed

Observation CD_Account

Check how Online data is distributed

Observation Online

Check how CreditCard data is distributed

Observation CreditCard

Check how Age data is distributed

Observation Age Data

Check how Experience data is distributed

Observation Experience

Check how Income data is distributed

Observation Income

Check how ZIPCode data is distributed

Observation ZIPCode

Check how Family data is distributed

Observation Family

Check how CCAvg data is distributed

Observation CCAvg

Check how Education data is distributed

Observation Education

Check how Mortgage data is distributed

Observation Mortgage

Insights from Intial Exploratory Data Analysis

Data Observations

Data Pre-processing & Data Cleaning, Feature conversions

Data cleaning and feature conversions based on knowledge gathered from intial data analysis

Treatment - missing values

No missing values. So no missing value treatment will be applied

Treatment - Feature conversions

Features Age, ZIP Code needs feature conversions

Zip code to county and state

finding all customer county and state using zipcode Installed zipcodes 1.1.3 python package to find county

Note : Already installed zipcodes package.

import zipcodes library and setup

Write a function that can validate given zipcode and find its county and state

Observation on missing data

Observation on missing value counts

Fix missing data values

we dont have data for 4 zip codes, searching and finding data

92717 - Orange, CA 96651 - WASHINGTON, DC 92634 - Fullerton, CA 93077 - Astoria, OR

No Missing Data in county and we have found all data using zip code

Age - Create age bins

Experience - Fix negative values & Create Experience bins

Outlier Teartments

Exploratory Data Analysis - Post Data processing / Data Cleaning

Visualize all features after data clean up and understand how data related with each other and with target dependent feature.

Check the data types of the columns in the dataset.

checking data types of all columns

Summary of the data

Let's check for missing values

lets check which columns has some null values, how many null values

Data in category columns - new columns

Lets us look at different values by features

Univariate analysis on new features

Univariate analysis helps to check data skewness and possible outliers and spread of the data.

checking how every feature has data after data cleaning and how it is related with dependent variables

AgeRange vs Personal_Loan

Observation on AgeRange with Personal_Loan

ExperienceRange vs Personal_Loan

Observation on ExperienceRange with Personal_Loan

County vs Personal_Loan

Observation on County with Personal_Loan

Bivariate Analysis

lets see how feature are related to each other and how its relation with target feature

Income vs CCAvg on Loan

Income vs CCAvg on Loan - Observations

Income vs Mortage on Loan

Income vs Mortage on Loan - Observations

Mortgage vs CCAvg on Loan

Mortgage vs CCAvg on Loan - Observations

Family vs Personal_Loan

observations

Education vs Personal_Loan

observations

observations

observations

observations

observations

Identify Correlation in data

lets check how the target feature related with other features and relationship between features

observations

Pair Plot

observations

Income & CCAvg, Income & Mortgage has positive relation , CCAvg & Mortage has scatter relation

Summary of EDA

Data Description:

Data Cleaning:

Observations from EDA:

Data Preparation & Split Data

Converting Categorical Features

Check the data types of the columns in X variable after get_dummies

Understand the shape of the dataset.
Check the data types of the columns in the dataset.

Split data Training & Test

Now it's time to do a train test split, and train our model!

Split the data into training set and testing set using train_test_split

Lets check split of data in %

Model building - Logistic Regression

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer is eligible for loan but in reality the customer is not - Loss of money for bank
  2. Predicting a customer is not eligible for loan but in reality the customer is elibigle - Loss of opportunity & revene loss.

Which case is more important?

How to reduce this loss

Create functions to calculate different metrics and confusion matrix

Training and Predicting

Logistic Regression

Train and fit a logistic regression model on the training set.

Model performance evaluation and improvement

Now predict values for the testing data. And Create a classification report for the model.

Checking model performance on training & test set

Checking model performance on training set

Checking performance on test set

ROC-AUC

Model Improvement Opportunities

Optimal threshold using AUC-ROC curve

Checking model performance on training & test set

Let's use Precision-Recall curve and see if we can find a better threshold

Model Performance Summary

lets compare all the data and see which threshold gives better options

observations

Enhancements - Sequential Feature Selector

Thres Recall vs Preci enhance model with Forward Feature Selection using SequentialFeatureSelector.

and see we can find a better model.

Selecting subset of important features using Sequential Feature Selector method

Observation

Lets build a model with top 20 features and see how results looks

Columns used for this Model

Conclusion on Logistic Regression Model

Finding the coefficients on the best model

Important Coefficient interpretations

Model building - Decision Tree

Build Decision Tree Model

Checking model performance on training set

Checking model performance on test set

observation on default model performance

Visualizing the Decision Tree

Text report - decision tree -

important features

observations

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

observations

Checking performance on training set

Checking performance on test set

observations

Visualizing the Decision Tree

important features

Key Features

observations on important features on tuning

Cost Complexity Pruning

F1 and Recall vs alpha for training and testing sets

Checking model performance on training set

Checking model performance on test set

Comparing all the decision tree models

Conclusion on Decision tree Models

Visualizing best model Decision Tree

important features on Best Model

Conclusions

Actionable Insights & Recommendations

Allbank should focus to find more potential customers and make them personal loan, Last year they gave 9% customers, But they have more potential ccustomer.

Bank should reach
Bank should avoid